Artificial Neural Networks (MLPs) for Tabular Data#
An artificial neural network for tabular data is usually a multi-layer perceptron (MLP): stacks of Linear layers with nonlinear activations (ReLU, GELU, …).
MLPs are a great learning tool because you can understand them end-to-end:
a forward pass is just matrix multiplications + activations
training is “just” gradient descent on a loss (via backprop)
On many real-world tabular problems, tree-based models (XGBoost/LightGBM/CatBoost) are often the strongest baseline; MLPs tend to shine when you have lots of data, learned embeddings for categorical features, or you need to combine tabular with other modalities.
Learning goals#
By the end, you should be able to:
explain how an MLP turns features into predictions
implement a 2-layer MLP in NumPy (forward + backprop)
train it with mini-batch SGD and visualize learning curves
build the same model in PyTorch and compare results
diagnose common tabular-MLP pitfalls (scaling, overfitting, LR)
Notation (quick)#
Features: \(X \in \mathbb{R}^{n\times d}\) (rows are samples)
Labels (binary): \(y \in \{0,1\}^n\)
First layer: \(z_1 = XW_1 + b_1\), \(a_1 = \mathrm{ReLU}(z_1)\)
Output logits: \(\ell = a_1W_2 + b_2\) (probability via sigmoid)
Table of contents#
What makes tabular data special?
A tiny nonlinear dataset + why scaling matters
Baseline: logistic regression (linear boundary)
From scratch: a 2-layer MLP in NumPy
Practical: the same model in PyTorch
Compare models + diagnostics
Practical tips for real tabular data
Exercises + references
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio
from sklearn.datasets import make_moons
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, log_loss
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset
pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")
np.set_printoptions(precision=4, suppress=True)
SEED = 42
rng = np.random.default_rng(SEED)
import warnings
torch.manual_seed(SEED)
with warnings.catch_warnings():
warnings.filterwarnings("ignore", message="CUDA initialization*", category=UserWarning)
has_cuda = torch.cuda.is_available()
if has_cuda:
torch.cuda.manual_seed_all(SEED)
device = torch.device("cuda" if has_cuda else "cpu")
device
device(type='cpu')
1) What makes tabular data special?#
Tabular data usually means:
each row is an entity (customer, transaction, patient)
columns are heterogeneous features (numeric + categorical + missing)
Compared to images/text, tabular datasets are often smaller and noisier, and the “right” inductive bias is less obvious.
For MLPs specifically, two habits matter a lot:
standardize numeric features (helps optimization)
treat categorical features carefully (often via embeddings)
2) A tiny nonlinear dataset + why scaling matters#
We’ll use a simple 2D dataset so we can visualize the decision boundary.
Even though it’s 2D, it’s still “tabular”: each row is a sample, and the two columns are features.
To make the scaling issue obvious, we’ll intentionally stretch one feature.
# Dataset
n_samples = 2000
X_raw, y = make_moons(n_samples=n_samples, noise=0.25, random_state=SEED)
# Force a scale mismatch (common in real tabular datasets)
X_raw = X_raw.astype(np.float64)
X_raw[:, 1] *= 3.0
y = y.astype(np.int64)
# Train/val/test split
X_train_raw, X_temp_raw, y_train, y_temp = train_test_split(
X_raw,
y,
test_size=0.30,
random_state=SEED,
stratify=y,
)
X_val_raw, X_test_raw, y_val, y_test = train_test_split(
X_temp_raw,
y_temp,
test_size=0.50,
random_state=SEED,
stratify=y_temp,
)
# Standardize using train split only
scaler = StandardScaler().fit(X_train_raw)
X_train = scaler.transform(X_train_raw)
X_val = scaler.transform(X_val_raw)
X_test = scaler.transform(X_test_raw)
X_train.shape, X_val.shape, X_test.shape
((1400, 2), (300, 2), (300, 2))
fig = px.scatter(
x=X_raw[:, 0],
y=X_raw[:, 1],
color=y.astype(str),
title="Raw features (note the scale mismatch)",
labels={"x": "feature_1", "y": "feature_2", "color": "class"},
)
fig.update_traces(marker=dict(size=5, opacity=0.7))
fig.show()
X_all = scaler.transform(X_raw)
fig = px.scatter(
x=X_all[:, 0],
y=X_all[:, 1],
color=y.astype(str),
title="Standardized features (zero mean, unit variance)",
labels={"x": "z(feature_1)", "y": "z(feature_2)", "color": "class"},
)
fig.update_traces(marker=dict(size=5, opacity=0.7))
fig.show()
def decision_boundary_figure(X2d, y, prob_fn, title, grid_n=250, pad=0.6):
X2d = np.asarray(X2d)
y = np.asarray(y)
x0_min, x0_max = X2d[:, 0].min() - pad, X2d[:, 0].max() + pad
x1_min, x1_max = X2d[:, 1].min() - pad, X2d[:, 1].max() + pad
xs = np.linspace(x0_min, x0_max, grid_n)
ys = np.linspace(x1_min, x1_max, grid_n)
xx, yy = np.meshgrid(xs, ys)
grid = np.c_[xx.ravel(), yy.ravel()]
probs = prob_fn(grid).reshape(xx.shape)
fig = go.Figure()
# Probability surface
fig.add_trace(
go.Contour(
x=xs,
y=ys,
z=probs,
zmin=0.0,
zmax=1.0,
colorscale="RdBu",
reversescale=True,
opacity=0.75,
colorbar=dict(title="P(class=1)"),
contours=dict(start=0.0, end=1.0, size=0.1),
)
)
# Decision boundary line at 0.5
fig.add_trace(
go.Contour(
x=xs,
y=ys,
z=probs,
contours=dict(start=0.5, end=0.5, size=0.5, coloring="lines"),
line=dict(color="black", width=3),
showscale=False,
)
)
# Points
fig.add_trace(
go.Scatter(
x=X2d[:, 0],
y=X2d[:, 1],
mode="markers",
marker=dict(color=y, colorscale="Viridis", size=5, opacity=0.75),
name="data",
)
)
fig.update_layout(
title=title,
xaxis_title="feature_1 (standardized)",
yaxis_title="feature_2 (standardized)",
legend=dict(orientation="h", yanchor="bottom", y=1.02, xanchor="right", x=1),
)
return fig
3) Baseline: logistic regression (linear boundary)#
Logistic regression is a linear classifier: it can only draw a single straight line in 2D.
Our dataset needs a curved boundary, so logistic regression should underfit.
log_reg = LogisticRegression(max_iter=2000, random_state=SEED)
log_reg.fit(X_train, y_train)
def eval_sklearn_binary(model, X, y):
probs = model.predict_proba(X)[:, 1]
preds = (probs >= 0.5).astype(np.int64)
return {
"acc": float(accuracy_score(y, preds)),
"logloss": float(log_loss(y, probs)),
}
baseline_metrics = {
"train": eval_sklearn_binary(log_reg, X_train, y_train),
"val": eval_sklearn_binary(log_reg, X_val, y_val),
"test": eval_sklearn_binary(log_reg, X_test, y_test),
}
baseline_metrics
{'train': {'acc': 0.86, 'logloss': 0.3129837863524164},
'val': {'acc': 0.8666666666666667, 'logloss': 0.29755669561639714},
'test': {'acc': 0.89, 'logloss': 0.27348769453997146}}
fig = decision_boundary_figure(
X_train,
y_train,
prob_fn=lambda X: log_reg.predict_proba(X)[:, 1],
title="Logistic regression decision boundary (linear)",
)
fig.show()
4) From scratch: a 2-layer MLP in NumPy#
A 2-layer MLP is:
a linear layer that mixes the input features
a nonlinearity (ReLU)
another linear layer to produce a logit
Even this small network can produce a piecewise-linear decision boundary that bends around the data.
Forward pass (binary classification)#
Hidden layer:
Output logit:
Probability:
Loss (binary cross-entropy, computed stably from logits):
Key gradient fact:
def relu(x):
return np.maximum(0.0, x)
def sigmoid(x):
return 1.0 / (1.0 + np.exp(-x))
def bce_with_logits_loss(logits, y):
"""Mean binary cross-entropy, but computed stably from logits.
logits: (n, 1)
y: (n, 1) in {0,1}
"""
logits = np.asarray(logits)
y = np.asarray(y)
return float((np.logaddexp(0.0, logits) - y * logits).mean())
def accuracy_from_logits(logits, y):
probs = sigmoid(logits)
preds = (probs >= 0.5).astype(np.int64)
return float((preds.ravel() == y.ravel()).mean())
def init_mlp(in_dim, hidden_dim, rng):
"""He initialization is a good default for ReLU networks."""
W1 = rng.normal(0.0, np.sqrt(2.0 / in_dim), size=(in_dim, hidden_dim))
b1 = np.zeros((hidden_dim,), dtype=np.float64)
W2 = rng.normal(0.0, np.sqrt(2.0 / hidden_dim), size=(hidden_dim, 1))
b2 = np.zeros((1,), dtype=np.float64)
return {"W1": W1, "b1": b1, "W2": W2, "b2": b2}
def mlp_forward(X, params):
W1, b1, W2, b2 = params["W1"], params["b1"], params["W2"], params["b2"]
z1 = X @ W1 + b1
a1 = relu(z1)
logits = a1 @ W2 + b2
cache = {"X": X, "z1": z1, "a1": a1}
return logits, cache
def mlp_loss_and_grads(X, y, params, weight_decay=0.0):
"""Return loss and gradients for a 2-layer MLP."""
y = y.reshape(-1, 1).astype(np.float64)
logits, cache = mlp_forward(X, params)
loss = bce_with_logits_loss(logits, y)
if weight_decay:
loss += 0.5 * weight_decay * (np.sum(params["W1"] ** 2) + np.sum(params["W2"] ** 2))
probs = sigmoid(logits)
n = X.shape[0]
# dL/dlogits = (sigmoid(logits) - y) / n
dlogits = (probs - y) / n
dW2 = cache["a1"].T @ dlogits
db2 = dlogits.sum(axis=0)
da1 = dlogits @ params["W2"].T
dz1 = da1 * (cache["z1"] > 0.0)
dW1 = cache["X"].T @ dz1
db1 = dz1.sum(axis=0)
if weight_decay:
dW1 = dW1 + weight_decay * params["W1"]
dW2 = dW2 + weight_decay * params["W2"]
grads = {"W1": dW1, "b1": db1, "W2": dW2, "b2": db2}
return loss, grads
def train_numpy_mlp(
X_train,
y_train,
X_val,
y_val,
*,
hidden_dim=32,
lr=0.1,
epochs=200,
batch_size=128,
weight_decay=1e-4,
seed=SEED,
):
rng_local = np.random.default_rng(seed)
params = init_mlp(in_dim=X_train.shape[1], hidden_dim=hidden_dim, rng=rng_local)
history = {
"epoch": [],
"train_loss": [],
"val_loss": [],
"train_acc": [],
"val_acc": [],
}
y_train_col = y_train.reshape(-1, 1)
y_val_col = y_val.reshape(-1, 1)
for epoch in range(1, epochs + 1):
idx = rng_local.permutation(X_train.shape[0])
for start in range(0, X_train.shape[0], batch_size):
batch_idx = idx[start : start + batch_size]
Xb = X_train[batch_idx]
yb = y_train_col[batch_idx]
_, grads = mlp_loss_and_grads(Xb, yb, params, weight_decay=weight_decay)
params["W1"] -= lr * grads["W1"]
params["b1"] -= lr * grads["b1"]
params["W2"] -= lr * grads["W2"]
params["b2"] -= lr * grads["b2"]
train_logits, _ = mlp_forward(X_train, params)
val_logits, _ = mlp_forward(X_val, params)
train_loss = bce_with_logits_loss(train_logits, y_train_col)
val_loss = bce_with_logits_loss(val_logits, y_val_col)
train_acc = accuracy_from_logits(train_logits, y_train_col)
val_acc = accuracy_from_logits(val_logits, y_val_col)
history["epoch"].append(epoch)
history["train_loss"].append(train_loss)
history["val_loss"].append(val_loss)
history["train_acc"].append(train_acc)
history["val_acc"].append(val_acc)
return params, history
params_np, hist_np = train_numpy_mlp(
X_train,
y_train,
X_val,
y_val,
hidden_dim=32,
lr=0.1,
epochs=200,
batch_size=128,
weight_decay=1e-4,
)
def eval_numpy_mlp(params, X, y):
logits, _ = mlp_forward(X, params)
probs = sigmoid(logits).ravel()
preds = (probs >= 0.5).astype(np.int64)
return {
"acc": float(accuracy_score(y, preds)),
"logloss": float(log_loss(y, probs)),
}
numpy_metrics = {
"train": eval_numpy_mlp(params_np, X_train, y_train),
"val": eval_numpy_mlp(params_np, X_val, y_val),
"test": eval_numpy_mlp(params_np, X_test, y_test),
}
numpy_metrics
{'train': {'acc': 0.9478571428571428, 'logloss': 0.14365309037199106},
'val': {'acc': 0.9333333333333333, 'logloss': 0.15531360103865552},
'test': {'acc': 0.95, 'logloss': 0.14805155593491454}}
fig = go.Figure()
fig.add_trace(go.Scatter(x=hist_np["epoch"], y=hist_np["train_loss"], name="train"))
fig.add_trace(go.Scatter(x=hist_np["epoch"], y=hist_np["val_loss"], name="val"))
fig.update_layout(
title="NumPy MLP: loss over epochs",
xaxis_title="epoch",
yaxis_title="binary cross-entropy",
)
fig.show()
fig = go.Figure()
fig.add_trace(go.Scatter(x=hist_np["epoch"], y=hist_np["train_acc"], name="train"))
fig.add_trace(go.Scatter(x=hist_np["epoch"], y=hist_np["val_acc"], name="val"))
fig.update_layout(
title="NumPy MLP: accuracy over epochs",
xaxis_title="epoch",
yaxis_title="accuracy",
yaxis=dict(range=[0.0, 1.0]),
)
fig.show()
fig = decision_boundary_figure(
X_train,
y_train,
prob_fn=lambda X: sigmoid(mlp_forward(X, params_np)[0]).ravel(),
title="NumPy MLP decision boundary (nonlinear)",
)
fig.show()
probs_np_test = sigmoid(mlp_forward(X_test, params_np)[0]).ravel()
preds_np_test = (probs_np_test >= 0.5).astype(np.int64)
cm = confusion_matrix(y_test, preds_np_test)
fig = px.imshow(
cm,
text_auto=True,
color_continuous_scale="Blues",
title="NumPy MLP: confusion matrix (test)",
labels=dict(x="predicted", y="true", color="count"),
)
fig.update_xaxes(tickmode="array", tickvals=[0, 1])
fig.update_yaxes(tickmode="array", tickvals=[0, 1])
fig.show()
5) Practical: the same model in PyTorch#
PyTorch gives you:
automatic differentiation (no manual backprop)
battle-tested optimizers (Adam, SGD+momentum)
easy batching with
DataLoader
We’ll build the same architecture and train it on the same standardized data.
X_train_t = torch.tensor(X_train, dtype=torch.float32)
y_train_t = torch.tensor(y_train.reshape(-1, 1), dtype=torch.float32)
X_val_t = torch.tensor(X_val, dtype=torch.float32)
y_val_t = torch.tensor(y_val.reshape(-1, 1), dtype=torch.float32)
X_test_t = torch.tensor(X_test, dtype=torch.float32)
y_test_t = torch.tensor(y_test.reshape(-1, 1), dtype=torch.float32)
train_loader = DataLoader(TensorDataset(X_train_t, y_train_t), batch_size=128, shuffle=True)
val_loader = DataLoader(TensorDataset(X_val_t, y_val_t), batch_size=256, shuffle=False)
torch_model = nn.Sequential(
nn.Linear(X_train.shape[1], 32),
nn.ReLU(),
nn.Linear(32, 1),
).to(device)
criterion = nn.BCEWithLogitsLoss()
optimizer = torch.optim.Adam(torch_model.parameters(), lr=0.03, weight_decay=1e-4)
def run_epoch(model, loader, *, train=False):
if train:
model.train()
else:
model.eval()
total_loss = 0.0
total_correct = 0.0
n = 0
for xb, yb in loader:
xb = xb.to(device)
yb = yb.to(device)
logits = model(xb)
loss = criterion(logits, yb)
if train:
optimizer.zero_grad()
loss.backward()
optimizer.step()
with torch.no_grad():
probs = torch.sigmoid(logits)
preds = (probs >= 0.5).float()
total_correct += (preds == yb).float().sum().item()
total_loss += loss.item() * xb.shape[0]
n += xb.shape[0]
return total_loss / n, total_correct / n
torch_hist = {
"epoch": [],
"train_loss": [],
"val_loss": [],
"train_acc": [],
"val_acc": [],
}
epochs = 120
for epoch in range(1, epochs + 1):
train_loss, train_acc = run_epoch(torch_model, train_loader, train=True)
val_loss, val_acc = run_epoch(torch_model, val_loader, train=False)
torch_hist["epoch"].append(epoch)
torch_hist["train_loss"].append(float(train_loss))
torch_hist["val_loss"].append(float(val_loss))
torch_hist["train_acc"].append(float(train_acc))
torch_hist["val_acc"].append(float(val_acc))
@torch.no_grad()
def torch_predict_proba(model, X):
model.eval()
Xt = torch.tensor(X, dtype=torch.float32, device=device)
probs = torch.sigmoid(model(Xt)).detach().cpu().numpy().ravel()
return probs
probs_torch_test = torch_predict_proba(torch_model, X_test)
preds_torch_test = (probs_torch_test >= 0.5).astype(np.int64)
torch_metrics = {
"train": {
"acc": float(accuracy_score(y_train, (torch_predict_proba(torch_model, X_train) >= 0.5).astype(np.int64))),
"logloss": float(log_loss(y_train, torch_predict_proba(torch_model, X_train))),
},
"val": {
"acc": float(accuracy_score(y_val, (torch_predict_proba(torch_model, X_val) >= 0.5).astype(np.int64))),
"logloss": float(log_loss(y_val, torch_predict_proba(torch_model, X_val))),
},
"test": {
"acc": float(accuracy_score(y_test, preds_torch_test)),
"logloss": float(log_loss(y_test, probs_torch_test)),
},
}
torch_metrics
{'train': {'acc': 0.95, 'logloss': 0.13017641005047428},
'val': {'acc': 0.93, 'logloss': 0.1488189262504534},
'test': {'acc': 0.95, 'logloss': 0.14634598848775623}}
fig = go.Figure()
fig.add_trace(go.Scatter(x=torch_hist["epoch"], y=torch_hist["train_loss"], name="train"))
fig.add_trace(go.Scatter(x=torch_hist["epoch"], y=torch_hist["val_loss"], name="val"))
fig.update_layout(
title="PyTorch MLP: loss over epochs",
xaxis_title="epoch",
yaxis_title="binary cross-entropy",
)
fig.show()
fig = go.Figure()
fig.add_trace(go.Scatter(x=torch_hist["epoch"], y=torch_hist["train_acc"], name="train"))
fig.add_trace(go.Scatter(x=torch_hist["epoch"], y=torch_hist["val_acc"], name="val"))
fig.update_layout(
title="PyTorch MLP: accuracy over epochs",
xaxis_title="epoch",
yaxis_title="accuracy",
yaxis=dict(range=[0.0, 1.0]),
)
fig.show()
fig = decision_boundary_figure(
X_train,
y_train,
prob_fn=lambda X: torch_predict_proba(torch_model, X),
title="PyTorch MLP decision boundary (nonlinear)",
)
fig.show()
cm = confusion_matrix(y_test, preds_torch_test)
fig = px.imshow(
cm,
text_auto=True,
color_continuous_scale="Blues",
title="PyTorch MLP: confusion matrix (test)",
labels=dict(x="predicted", y="true", color="count"),
)
fig.update_xaxes(tickmode="array", tickvals=[0, 1])
fig.update_yaxes(tickmode="array", tickvals=[0, 1])
fig.show()
6) Compare models + diagnostics#
On this toy dataset, both MLPs should learn a nonlinear boundary and outperform logistic regression.
We’ll compare test accuracy and log loss (probabilistic quality).
models = ["log_reg", "numpy_mlp", "torch_mlp"]
test_acc = [
baseline_metrics["test"]["acc"],
numpy_metrics["test"]["acc"],
torch_metrics["test"]["acc"],
]
test_logloss = [
baseline_metrics["test"]["logloss"],
numpy_metrics["test"]["logloss"],
torch_metrics["test"]["logloss"],
]
fig = go.Figure(go.Bar(x=models, y=test_acc))
fig.update_layout(title="Test accuracy", xaxis_title="model", yaxis_title="accuracy", yaxis=dict(range=[0.0, 1.0]))
fig.show()
fig = go.Figure(go.Bar(x=models, y=test_logloss))
fig.update_layout(title="Test log loss (lower is better)", xaxis_title="model", yaxis_title="log loss")
fig.show()
7) Practical tips for real tabular data#
Standardize numeric features (and keep the scaler fitted on train only).
Categorical features: try learned embeddings (
nn.Embedding) instead of one-hot for high-cardinality columns.Missing values: add missingness indicators; don’t just impute and hope.
Overfitting is common: use weight decay, dropout, early stopping, and a strong validation protocol.
Learning rate matters more than architecture. When in doubt, sweep
lrand use Adam.Baselines first: compare against logistic regression and strong tree-based models.
Calibration: optimize log loss / calibration if your probabilities will drive decisions.
8) Exercises#
Add another hidden layer (2 hidden layers total). Does it help? Does it overfit?
Replace ReLU with
tanh. What changes in training speed / final accuracy?Implement dropout in the NumPy model.
Turn this into a multiclass problem (softmax + cross-entropy).
Try a real tabular dataset (e.g., UCI) and compare with a tree baseline.
References#
PyTorch: https://pytorch.org/docs/stable/index.html
Goodfellow, Bengio, Courville — Deep Learning (MLP + backprop chapters)
scikit-learn MLPClassifier docs (for a practical baseline)